Skip to content

OCPBUGS-84577: clear stale EtcdRecoveryActive failure condition when etcd is healthy#8406

Open
vsolanki12 wants to merge 1 commit into
openshift:mainfrom
vsolanki12:OCPBUGS-84577-etcd-recovery-stale-condition
Open

OCPBUGS-84577: clear stale EtcdRecoveryActive failure condition when etcd is healthy#8406
vsolanki12 wants to merge 1 commit into
openshift:mainfrom
vsolanki12:OCPBUGS-84577-etcd-recovery-stale-condition

Conversation

@vsolanki12
Copy link
Copy Markdown
Contributor

@vsolanki12 vsolanki12 commented May 4, 2026

What this PR does / why we need it:

When the etcd recovery job fails but etcd self-heals, the EtcdRecoveryJobFailed condition was never cleared. This caused the OpenShift Console to display a stale error message ("Error in Etcd Recovery job: the Etcd cluster requires manual intervention.") on the HostedCluster overview page, even when the cluster was fully healthy (Available=True, Degraded=False, EtcdAvailable=True).

This fix adds two checks in reconcileETCDMemberRecovery:

  • When a failed recovery job exists but the etcd StatefulSet is fully available (3/3 replicas), clean up the job and clear the condition.
  • When no failing etcd pods exist and etcd is healthy, clear any stale EtcdRecoveryJobFailed condition.

Which issue(s) this PR fixes:

Fixes https://issues.redhat.com/browse/OCPBUGS-84577

Special notes for your reviewer:

  • The etcd recovery feature is gated behind ENABLE_ETCD_RECOVERY env var and only applies to managed, highly-available etcd clusters.
  • Both fix paths were verified on a live KubeVirt HCP cluster by simulating the stale condition and confirming it gets cleared.
  • The etcd_manual_intervention_required metric (which reads this condition) will also correctly reset.

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Summary by CodeRabbit

  • Bug Fixes

    • Automatically cleans up etcd recovery resources and clears the recovery-active failure state when the etcd cluster reports full readiness, avoiding unnecessary manual-intervention alerts while still signaling failures when the cluster is unhealthy.
    • Clears stale recovery failure conditions when there is no failing etcd pod and the cluster is fully available.
  • Tests

    • Added unit tests covering recovery success/failure paths, condition clearing, and cleanup behavior.

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 4, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 4, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels May 4, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@vsolanki12: This pull request references Jira Issue OCPBUGS-84577, which is invalid:

  • expected the bug to target the "5.0.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

What this PR does / why we need it:

When the etcd recovery job fails but etcd self-heals, the EtcdRecoveryJobFailed condition was never cleared. This caused the OpenShift Console to display a stale error message ("Error in Etcd Recovery job: the Etcd cluster requires manual intervention.") on the HostedCluster overview page, even when the cluster was fully healthy (Available=True, Degraded=False, EtcdAvailable=True).

This fix adds two checks in reconcileETCDMemberRecovery:

  • When a failed recovery job exists but the etcd StatefulSet is fully available (3/3 replicas), clean up the job and clear the condition.
  • When no failing etcd pods exist and etcd is healthy, clear any stale EtcdRecoveryJobFailed condition.

Which issue(s) this PR fixes:

Fixes https://issues.redhat.com/browse/OCPBUGS-84577

Special notes for your reviewer:

  • The etcd recovery feature is gated behind ENABLE_ETCD_RECOVERY env var and only applies to managed, highly-available etcd clusters.
  • Both fix paths were verified on a live KubeVirt HCP cluster by simulating the stale condition and confirming it gets cleared.
  • The etcd_manual_intervention_required metric (which reads this condition) will also correctly reset.

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 4, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

When an existing etcd recovery Job finishes unsuccessfully, the reconciler now fetches the etcd StatefulSet; if the StatefulSet reports ReadyReplicas==3 and AvailableReplicas==3 it deletes the recovery Job and related objects, clears the EtcdRecoveryActive condition (Status=False, Reason=AsExpectedReason) on the HostedCluster, updates HostedCluster status, and returns without marking manual-intervention. If the StatefulSet is not fully ready, the reconciler retains the prior failure handling (EtcdRecoveryJobFailedReason). Separately, when no failing etcd pod is detected and the StatefulSet is fully available, the reconciler clears a stale EtcdRecoveryActive failure condition only if its prior Reason was EtcdRecoveryJobFailedReason, updating status. A unit test verifies these behaviors.

Sequence Diagram

sequenceDiagram
    participant Reconciler
    participant ETCDJob as ETCD Recovery Job
    participant ETCDSet as ETCD StatefulSet
    participant HostedCluster as HostedCluster Status

    Reconciler->>ETCDJob: Check if Job exists and failed
    alt Job Failed
        Reconciler->>ETCDSet: Fetch StatefulSet readiness
        alt ReadyReplicas==3 && AvailableReplicas==3
            Reconciler->>ETCDJob: Delete recovery Job and objects
            Reconciler->>HostedCluster: Set EtcdRecoveryActive=False (Reason: AsExpectedReason)
            Reconciler->>HostedCluster: Update status
        else StatefulSet Not Fully Ready
            Reconciler->>HostedCluster: Set EtcdRecoveryActive=True/Failed (Reason: EtcdRecoveryJobFailedReason)
            Reconciler->>HostedCluster: Update status
        end
    else No Failing Pod Detected
        Reconciler->>ETCDSet: Check if fully available
        alt StatefulSet Fully Available
            Reconciler->>HostedCluster: If prior Reason==EtcdRecoveryJobFailedReason, clear EtcdRecoveryActive (Status=False, Reason=AsExpectedReason)
            Reconciler->>HostedCluster: Update status
        end
    end
Loading
🚥 Pre-merge checks | ✅ 10 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Structure And Quality ⚠️ Warning Test lacks assertion messages. 5 of 6 key assertions missing failure messages to diagnose failures, violating requirement #4 for meaningful diagnostic messages. Add descriptive failure messages to all assertions: err check, client.Get check, condition existence checks, and condition reason check. Example: g.Expect(err).ToNot(HaveOccurred(), "reconcileETCDMemberRecovery failed unexpectedly")
✅ Passed checks (10 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly and specifically describes the main change: clearing a stale EtcdRecoveryActive failure condition when etcd becomes healthy, which matches the core fix implemented in the changeset.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed All test names in TestReconcileETCDMemberRecovery are stable and deterministic with no dynamic content like pod names, timestamps, UUIDs, or IP addresses. Test names are descriptive and static.
Microshift Test Compatibility ✅ Passed The new test TestReconcileETCDMemberRecovery is a standard Go unit test (testing.T), not a Ginkgo e2e test. The check applies only to Ginkgo e2e tests, so this check is not applicable.
Single Node Openshift (Sno) Test Compatibility ✅ Passed No Ginkgo e2e tests added. The new test is a standard Go unit test using testing.T, not Ginkgo. SNO compatibility check not applicable.
Topology-Aware Scheduling Compatibility ✅ Passed No scheduling constraints. Controller monitoring logic only, gated by HighlyAvailable topology. No affinity, node selectors, or topology spreads.
Ote Binary Stdout Contract ✅ Passed Check not applicable. Modified files are HyperShift controller code and unit tests, not OTE test extension binaries. No stdout writes found.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed Check is not applicable. The PR adds TestReconcileETCDMemberRecovery, a standard Go unit test using fake clients, not a Ginkgo e2e test. The custom check applies only to Ginkgo e2e tests.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot added area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release and removed do-not-merge/needs-area labels May 4, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 4, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: vsolanki12
Once this PR has been reviewed and has the lgtm label, please assign sjenning for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@vsolanki12 vsolanki12 changed the title fix(etcd-recovery): OCPBUGS-84577: clear stale EtcdRecoveryActive failure condition when etcd is healthy OCPBUGS-84577: clear stale EtcdRecoveryActive failure condition when etcd is healthy May 4, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented May 4, 2026

Codecov Report

❌ Patch coverage is 55.55556% with 12 lines in your changes missing coverage. Please review.
✅ Project coverage is 39.88%. Comparing base (96a2bcb) to head (d34dc0d).
⚠️ Report is 8 commits behind head on main.

Files with missing lines Patch % Lines
...perator/controllers/hostedcluster/etcd_recovery.go 55.55% 8 Missing and 4 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8406      +/-   ##
==========================================
+ Coverage   39.85%   39.88%   +0.02%     
==========================================
  Files         751      751              
  Lines       92554    92582      +28     
==========================================
+ Hits        36889    36922      +33     
+ Misses      52992    52976      -16     
- Partials     2673     2684      +11     
Files with missing lines Coverage Δ
...perator/controllers/hostedcluster/etcd_recovery.go 43.03% <55.55%> (+7.68%) ⬆️

... and 1 file with indirect coverage changes

Flag Coverage Δ
cmd-support 34.03% <ø> (ø)
cpo-hostedcontrolplane 40.52% <ø> (-0.02%) ⬇️
cpo-other 40.08% <ø> (ø)
hypershift-operator 50.46% <55.55%> (+0.08%) ⬆️
other 30.70% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
hypershift-operator/controllers/hostedcluster/hostedcluster_controller_test.go (1)

6622-6627: ⚡ Quick win

Assert the failed recovery job is gone in the cleanup case.

The recovered-after-failed-job case only checks the condition reason. If cleanup regresses and the stale Job stays behind, this test still passes even though one of the main behaviors in this PR is broken.

Suggested assertion
 	testCases := []struct {
 		name            string
 		objects         []crclient.Object
 		conditions      []metav1.Condition
 		expectedReason  string
 		conditionExists bool
+		expectJobDeleted bool
 	}{
 		{
 			name:            "When failed job exists but etcd recovered it should cleanup job and clear condition",
 			conditions:      []metav1.Condition{staleCondition},
 			objects:         append(healthyEtcdPods(), healthyStatefulSet, failedJob),
 			expectedReason:  hyperv1.AsExpectedReason,
 			conditionExists: true,
+			expectJobDeleted: true,
 		},
 	}
@@
 			if tc.conditionExists {
 				g.Expect(condition).ToNot(BeNil())
 				g.Expect(condition.Reason).To(Equal(tc.expectedReason))
 			} else {
 				g.Expect(condition).To(BeNil())
 			}
+
+			if tc.expectJobDeleted {
+				job := etcdrecoverymanifests.EtcdRecoveryJob(hcpNS)
+				err := client.Get(t.Context(), crclient.ObjectKeyFromObject(job), job)
+				g.Expect(errors2.IsNotFound(err)).To(BeTrue())
+			}
 		})
 	}
 }

Also applies to: 6677-6686

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@hypershift-operator/controllers/hostedcluster/hostedcluster_controller_test.go`
around lines 6622 - 6627, The test cases that use the failedJob fixture (the
case with name "When failed job exists but etcd recovered it should cleanup job
and clear condition" and the similar case around lines 6677-6686) only assert
condition reasons; add an assertion after reconciliation that the failed Job
(failedJob) has been removed—e.g., attempt to Get the Job by failedJob.Name in
the test namespace and assert the client returns NotFound or that listing Jobs
returns zero matching entries. Locate the assertions around
expectedReason/conditionExists in hostedcluster_controller_test.go and add the
cleanup check for failedJob for both test cases so a lingering Job causes the
test to fail.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In
`@hypershift-operator/controllers/hostedcluster/hostedcluster_controller_test.go`:
- Around line 6622-6627: The test cases that use the failedJob fixture (the case
with name "When failed job exists but etcd recovered it should cleanup job and
clear condition" and the similar case around lines 6677-6686) only assert
condition reasons; add an assertion after reconciliation that the failed Job
(failedJob) has been removed—e.g., attempt to Get the Job by failedJob.Name in
the test namespace and assert the client returns NotFound or that listing Jobs
returns zero matching entries. Locate the assertions around
expectedReason/conditionExists in hostedcluster_controller_test.go and add the
cleanup check for failedJob for both test cases so a lingering Job causes the
test to fail.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: d69011b7-ad73-45d5-aa19-6f7fc30bdb73

📥 Commits

Reviewing files that changed from the base of the PR and between 68106f0 and 7f5e921.

📒 Files selected for processing (4)
  • control-plane-operator/hostedclusterconfigoperator/controllers/inplaceupgrader/inplaceupgrader.go
  • control-plane-operator/hostedclusterconfigoperator/controllers/inplaceupgrader/inplaceupgrader_test.go
  • hypershift-operator/controllers/hostedcluster/etcd_recovery.go
  • hypershift-operator/controllers/hostedcluster/hostedcluster_controller_test.go

Comment thread hypershift-operator/controllers/hostedcluster/etcd_recovery.go Outdated
@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels May 4, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@vsolanki12: This pull request references Jira Issue OCPBUGS-84577, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

What this PR does / why we need it:

When the etcd recovery job fails but etcd self-heals, the EtcdRecoveryJobFailed condition was never cleared. This caused the OpenShift Console to display a stale error message ("Error in Etcd Recovery job: the Etcd cluster requires manual intervention.") on the HostedCluster overview page, even when the cluster was fully healthy (Available=True, Degraded=False, EtcdAvailable=True).

This fix adds two checks in reconcileETCDMemberRecovery:

  • When a failed recovery job exists but the etcd StatefulSet is fully available (3/3 replicas), clean up the job and clear the condition.
  • When no failing etcd pods exist and etcd is healthy, clear any stale EtcdRecoveryJobFailed condition.

Which issue(s) this PR fixes:

Fixes https://issues.redhat.com/browse/OCPBUGS-84577

Special notes for your reviewer:

  • The etcd recovery feature is gated behind ENABLE_ETCD_RECOVERY env var and only applies to managed, highly-available etcd clusters.
  • Both fix paths were verified on a live KubeVirt HCP cluster by simulating the stale condition and confirming it gets cleared.
  • The etcd_manual_intervention_required metric (which reads this condition) will also correctly reset.

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Summary by CodeRabbit

  • Bug Fixes

  • Enhanced error messages during in-place upgrades to include degraded node names and specific failure reasons.

  • Improved etcd recovery handling to automatically clean up recovery resources when etcd returns to healthy state, reducing manual intervention needs.

  • Tests

  • Added test coverage for degraded node scenarios during upgrades and etcd recovery status conditions.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@vsolanki12 vsolanki12 force-pushed the OCPBUGS-84577-etcd-recovery-stale-condition branch from d28aec7 to a9c44fb Compare May 4, 2026 13:57
@openshift-ci-robot
Copy link
Copy Markdown

@vsolanki12: This pull request references Jira Issue OCPBUGS-84577, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)
Details

In response to this:

What this PR does / why we need it:

When the etcd recovery job fails but etcd self-heals, the EtcdRecoveryJobFailed condition was never cleared. This caused the OpenShift Console to display a stale error message ("Error in Etcd Recovery job: the Etcd cluster requires manual intervention.") on the HostedCluster overview page, even when the cluster was fully healthy (Available=True, Degraded=False, EtcdAvailable=True).

This fix adds two checks in reconcileETCDMemberRecovery:

  • When a failed recovery job exists but the etcd StatefulSet is fully available (3/3 replicas), clean up the job and clear the condition.
  • When no failing etcd pods exist and etcd is healthy, clear any stale EtcdRecoveryJobFailed condition.

Which issue(s) this PR fixes:

Fixes https://issues.redhat.com/browse/OCPBUGS-84577

Special notes for your reviewer:

  • The etcd recovery feature is gated behind ENABLE_ETCD_RECOVERY env var and only applies to managed, highly-available etcd clusters.
  • Both fix paths were verified on a live KubeVirt HCP cluster by simulating the stale condition and confirming it gets cleared.
  • The etcd_manual_intervention_required metric (which reads this condition) will also correctly reset.

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Summary by CodeRabbit

  • Bug Fixes

  • In-place upgrade failures now report the degraded node name and its degraded-reason annotation.

  • Etcd recovery no longer forces manual intervention when the cluster is healthy; recovery resources and failure condition are cleared when etcd is fully ready.

  • Tests

  • Added tests covering degraded-node upgrade errors and etcd recovery status transitions.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@vsolanki12 vsolanki12 force-pushed the OCPBUGS-84577-etcd-recovery-stale-condition branch 2 times, most recently from 9462d10 to 5873d33 Compare May 4, 2026 14:07
@vsolanki12
Copy link
Copy Markdown
Contributor Author

/auto-cc

@openshift-ci openshift-ci Bot requested review from bryan-cox and sdminonne May 5, 2026 05:34
@vsolanki12 vsolanki12 marked this pull request as ready for review May 5, 2026 09:42
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 5, 2026
@openshift-ci openshift-ci Bot requested a review from Nirshal May 5, 2026 09:42
@vsolanki12
Copy link
Copy Markdown
Contributor Author

@coderabbitai full review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 5, 2026

✅ Actions performed

Full review triggered.

@vsolanki12 vsolanki12 force-pushed the OCPBUGS-84577-etcd-recovery-stale-condition branch from 5873d33 to cafae49 Compare May 5, 2026 10:24
@vsolanki12
Copy link
Copy Markdown
Contributor Author

@coderabbitai full review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 5, 2026

✅ Actions performed

Full review triggered.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
hypershift-operator/controllers/hostedcluster/etcd_recovery.go (1)

65-66: ⚡ Quick win

Avoid hardcoding the expected etcd replica count to 3.

Both health checks gate behavior on ReadyReplicas == 3 && AvailableReplicas == 3. Using the StatefulSet desired replicas keeps this logic correct if replica sizing changes in the future.

♻️ Suggested change
-} else if etcdStatefulSet.Status.ReadyReplicas == 3 && etcdStatefulSet.Status.AvailableReplicas == 3 {
+} else {
+	expectedReplicas := int32(1)
+	if etcdStatefulSet.Spec.Replicas != nil {
+		expectedReplicas = *etcdStatefulSet.Spec.Replicas
+	}
+	if etcdStatefulSet.Status.ReadyReplicas == expectedReplicas && etcdStatefulSet.Status.AvailableReplicas == expectedReplicas {
 		log.Info("etcd recovered despite failed recovery job, cleaning up")
 		...
 		return nil, nil
+	}
 }
-fullyAvailable := etcdStatefulSet.Status.ReadyReplicas == 3 && etcdStatefulSet.Status.AvailableReplicas == 3
+expectedReplicas := int32(1)
+if etcdStatefulSet.Spec.Replicas != nil {
+	expectedReplicas = *etcdStatefulSet.Spec.Replicas
+}
+fullyAvailable := etcdStatefulSet.Status.ReadyReplicas == expectedReplicas &&
+	etcdStatefulSet.Status.AvailableReplicas == expectedReplicas

Also applies to: 125-126

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@hypershift-operator/controllers/hostedcluster/etcd_recovery.go` around lines
65 - 66, Replace the hardcoded "3" replica checks with the StatefulSet's desired
replica count: read the intended replica value from
etcdStatefulSet.Spec.Replicas (dereference the *int32) and use that for
comparisons instead of 3 in the blocks that check
etcdStatefulSet.Status.ReadyReplicas and
etcdStatefulSet.Status.AvailableReplicas (the checks around etcdStatefulSet). Do
the same replacement for the similar check later in the file (the other
ReadyReplicas/AvailableReplicas condition).
hypershift-operator/controllers/hostedcluster/hostedcluster_controller_test.go (1)

6719-6730: ⚡ Quick win

Harden assertions to avoid false positives and brittle name-coupling.

The test currently keys a cleanup assertion off a case-name string and only checks Reason. Add explicit per-case flags (e.g., expectJobDeleted) and assert condition.Status as well to make failures more precise.

✅ Suggested test tightening
 type testCases := []struct {
     name            string
     objects         []crclient.Object
     conditions      []metav1.Condition
     expectedReason  string
     conditionExists bool
+    expectJobDeleted bool
 }{
   {
     name:            "When failed job exists but etcd recovered it should cleanup job and clear condition",
     ...
     conditionExists: true,
+    expectJobDeleted: true,
   },
 }

 ...

 if tc.conditionExists {
     g.Expect(condition).ToNot(BeNil())
+    g.Expect(condition.Status).To(Equal(metav1.ConditionFalse))
     g.Expect(condition.Reason).To(Equal(tc.expectedReason))
 } else {
     g.Expect(condition).To(BeNil())
 }
-if tc.name == "When failed job exists but etcd recovered it should cleanup job and clear condition" {
+if tc.expectJobDeleted {
     job := etcdrecoverymanifests.EtcdRecoveryJob(hcpNS)
     err := client.Get(t.Context(), crclient.ObjectKeyFromObject(job), job)
     g.Expect(errors2.IsNotFound(err)).To(BeTrue(), "expected failed recovery job to be deleted")
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@hypershift-operator/controllers/hostedcluster/hostedcluster_controller_test.go`
around lines 6719 - 6730, The test currently couples a job-deletion assertion to
the case name and only checks condition.Reason; add a boolean test-case field
(e.g., tc.expectJobDeleted) and use that flag instead of comparing tc.name to
decide whether to fetch etcdrecoverymanifests.EtcdRecoveryJob and assert
deletion via client.Get; also tighten the condition check by asserting
condition.Status equals an explicit tc.expectedStatus (in addition to
tc.expectedReason) when tc.conditionExists, referencing the existing
meta.FindStatusCondition lookup for hyperv1.EtcdRecoveryActive and the
tc.conditionExists / tc.expectedReason symbols to locate where to add the new
assertions.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@hypershift-operator/controllers/hostedcluster/etcd_recovery.go`:
- Around line 65-66: Replace the hardcoded "3" replica checks with the
StatefulSet's desired replica count: read the intended replica value from
etcdStatefulSet.Spec.Replicas (dereference the *int32) and use that for
comparisons instead of 3 in the blocks that check
etcdStatefulSet.Status.ReadyReplicas and
etcdStatefulSet.Status.AvailableReplicas (the checks around etcdStatefulSet). Do
the same replacement for the similar check later in the file (the other
ReadyReplicas/AvailableReplicas condition).

In
`@hypershift-operator/controllers/hostedcluster/hostedcluster_controller_test.go`:
- Around line 6719-6730: The test currently couples a job-deletion assertion to
the case name and only checks condition.Reason; add a boolean test-case field
(e.g., tc.expectJobDeleted) and use that flag instead of comparing tc.name to
decide whether to fetch etcdrecoverymanifests.EtcdRecoveryJob and assert
deletion via client.Get; also tighten the condition check by asserting
condition.Status equals an explicit tc.expectedStatus (in addition to
tc.expectedReason) when tc.conditionExists, referencing the existing
meta.FindStatusCondition lookup for hyperv1.EtcdRecoveryActive and the
tc.conditionExists / tc.expectedReason symbols to locate where to add the new
assertions.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: ff88e339-71fb-4394-b3b5-f412fbfc2743

📥 Commits

Reviewing files that changed from the base of the PR and between 0ca85b1 and cafae49.

📒 Files selected for processing (2)
  • hypershift-operator/controllers/hostedcluster/etcd_recovery.go
  • hypershift-operator/controllers/hostedcluster/hostedcluster_controller_test.go

@vsolanki12 vsolanki12 force-pushed the OCPBUGS-84577-etcd-recovery-stale-condition branch 2 times, most recently from cc510cb to a2a7e34 Compare May 5, 2026 11:54
@vsolanki12
Copy link
Copy Markdown
Contributor Author

/retest

@vsolanki12
Copy link
Copy Markdown
Contributor Author

@coderabbitai full review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 7, 2026

✅ Actions performed

Full review triggered.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
hypershift-operator/controllers/hostedcluster/hostedcluster_controller_test.go (1)

6640-6677: ⚡ Quick win

Add a regression case for StatefulSet read failure in failed-job path.

Current cases validate healthy/unhealthy outcomes, but not the new non-NotFound error branch (production Line 61-64 in reconcileETCDMemberRecovery). A dedicated case would lock in that behavior.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@hypershift-operator/controllers/hostedcluster/hostedcluster_controller_test.go`
around lines 6640 - 6677, Add a regression test case in the testCases slice that
covers the non-NotFound StatefulSet read error path in
reconcileETCDMemberRecovery: include failedJob and the etcd pod objects (e.g.,
use healthyEtcdPods()), ensure the test harness simulates a GET error for the
StatefulSet (not a NotFound) instead of returning
unhealthyStatefulSet/healthyStatefulSet, and assert the behavior you expect for
that branch (e.g., expectedReason remains hyperv1.EtcdRecoveryJobFailedReason
and expectJobDeleted is false). Reference the existing identifiers failedJob,
reconcileETCDMemberRecovery, healthyEtcdPods,
unhealthyStatefulSet/healthyStatefulSet and add the new case to the testCases
array so the fake client error path is exercised.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In
`@hypershift-operator/controllers/hostedcluster/hostedcluster_controller_test.go`:
- Around line 6640-6677: Add a regression test case in the testCases slice that
covers the non-NotFound StatefulSet read error path in
reconcileETCDMemberRecovery: include failedJob and the etcd pod objects (e.g.,
use healthyEtcdPods()), ensure the test harness simulates a GET error for the
StatefulSet (not a NotFound) instead of returning
unhealthyStatefulSet/healthyStatefulSet, and assert the behavior you expect for
that branch (e.g., expectedReason remains hyperv1.EtcdRecoveryJobFailedReason
and expectJobDeleted is false). Reference the existing identifiers failedJob,
reconcileETCDMemberRecovery, healthyEtcdPods,
unhealthyStatefulSet/healthyStatefulSet and add the new case to the testCases
array so the fake client error path is exercised.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 6fb89972-970a-4f3f-b1e7-2cadf40e46f3

📥 Commits

Reviewing files that changed from the base of the PR and between 6b39d47 and a2a7e34.

📒 Files selected for processing (2)
  • hypershift-operator/controllers/hostedcluster/etcd_recovery.go
  • hypershift-operator/controllers/hostedcluster/hostedcluster_controller_test.go

@vsolanki12 vsolanki12 force-pushed the OCPBUGS-84577-etcd-recovery-stale-condition branch from a2a7e34 to 01953c1 Compare May 7, 2026 15:21
Copy link
Copy Markdown
Contributor

@sdminonne sdminonne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May you follow up?

etcdRecoveryActiveCondition.Status = metav1.ConditionFalse
etcdRecoveryActiveCondition.Reason = hyperv1.AsExpectedReason
etcdRecoveryActiveCondition.LastTransitionTime = r.now()
meta.SetStatusCondition(&hcluster.Status.Conditions, etcdRecoveryActiveCondition)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about etcdRecoveryActiveCondition.Message in this path?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I have made all the changes suggested on the etcd_recovery.go & hostedcluster_controller_test.go file.

Type: string(hyperv1.EtcdRecoveryActive),
Status: metav1.ConditionFalse,
Reason: hyperv1.AsExpectedReason,
ObservedGeneration: hcluster.Generation,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about etcdRecoveryActiveCondition.Message in this path?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I have made all the changes suggested on the etcd_recovery.go & hostedcluster_controller_test.go file.

var pods []crclient.Object
for i := 0; i < 3; i++ {
pods = append(pods, &corev1.Pod{
ObjectMeta: metav1.ObjectMeta{
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May you add a case where RestartCount is > 0 and Running state?
To cover case when POD recovers.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I have made all the changes suggested on the etcd_recovery.go & hostedcluster_controller_test.go file.

@sdminonne
Copy link
Copy Markdown
Contributor

/cc @sdminonne

@vsolanki12
Copy link
Copy Markdown
Contributor Author

/retest

@vsolanki12
Copy link
Copy Markdown
Contributor Author

/verified by me in local env

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label May 11, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@vsolanki12: This PR has been marked as verified by me in local env.

Details

In response to this:

/verified by me in local env

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@vsolanki12
Copy link
Copy Markdown
Contributor Author

supported logs:
Before fix:
{ "type": "EtcdRecoveryActive", "status": "False", "reason": "EtcdRecoveryJobFailed", "message": "Error in Etcd Recovery job: the Etcd cluster requires manual intervention.", "lastTransitionTime": "..." }

After fix:
{ "type": "EtcdRecoveryActive", "status": "False", "reason": "AsExpected", "lastTransitionTime": "..." }

@vsolanki12
Copy link
Copy Markdown
Contributor Author

/test e2e-v2-aws

@vsolanki12 vsolanki12 force-pushed the OCPBUGS-84577-etcd-recovery-stale-condition branch from 7788d6b to 6e9192b Compare May 12, 2026 07:10
@openshift-ci-robot openshift-ci-robot removed the verified Signifies that the PR passed pre-merge verification criteria label May 12, 2026
@hypershift-jira-solve-ci
Copy link
Copy Markdown

Test Failure Analysis Complete

Job Information

Test Failure Analysis

Error

--- FAIL: TestReconcileETCDMemberRecovery (0.01s)
    --- FAIL: TestReconcileETCDMemberRecovery/When_etcd_is_healthy_and_stale_EtcdRecoveryJobFailed_condition_exists_it_should_clear_the_condition (0.00s)
        hostedcluster_controller_test.go:6920:
            Expected
                <string>: EtcdRecoveryJobFailed
            to equal
                <string>: AsExpected
    --- FAIL: TestReconcileETCDMemberRecovery/When_failed_job_exists_but_etcd_recovered_it_should_cleanup_job_and_clear_condition (0.00s)
        hostedcluster_controller_test.go:6920:
            Expected
                <string>: EtcdRecoveryJobFailed
            to equal
                <string>: AsExpected
    --- FAIL: TestReconcileETCDMemberRecovery/When_etcd_pods_have_restarted_but_recovered_it_should_clear_the_stale_condition (0.00s)
        hostedcluster_controller_test.go:6920:
            Expected
                <string>: EtcdRecoveryJobFailed
            to equal
                <string>: AsExpected

Summary

The codecov/project check failed because the "Unit Tests (hypershift-operator)" GitHub Actions job failed, which caused the "Upload to Codecov" step to be skipped. Without coverage data for the hypershift-operator package (85 files, ~24,000 lines), Codecov reported a -3.68% coverage drop. The underlying unit test failure is a bug introduced by the PR itself: the setEtcdRecoveryCondition function silently skips the status update when only the Reason changes but the Status field remains the same, so the stale EtcdRecoveryJobFailed reason is never overwritten to AsExpected.

Root Cause

The root cause is a logic bug in setEtcdRecoveryCondition() at line 121 of etcd_recovery.go. The guard condition only triggers an update when the condition's Status field changes:

if oldCondition == nil || oldCondition.Status != condition.Status {
    meta.SetStatusCondition(&hcluster.Status.Conditions, condition)
    // ...
}

The PR adds logic to clear a stale EtcdRecoveryActive condition by setting it to Status=ConditionFalse with Reason=AsExpected. However, the stale condition already has Status=ConditionFalse (with Reason=EtcdRecoveryJobFailed). Since the Status field is identical (ConditionFalseConditionFalse), the guard clause short-circuits and the Reason is never updated.

This affects all three failing test subtests:

  1. When_etcd_is_healthy_and_stale_EtcdRecoveryJobFailed_condition_exists — enters detectAndTriggerEtcdRecovery, finds no failing pods, etcd is fully available, detects stale condition, calls setEtcdRecoveryCondition(ConditionFalse, AsExpected) — skipped because Status unchanged.
  2. When_failed_job_exists_but_etcd_recovered — enters handleExistingEtcdRecoveryJob, job failed but etcd has 3/3 replicas, calls setEtcdRecoveryCondition(ConditionFalse, AsExpected) — skipped because Status unchanged.
  3. When_etcd_pods_have_restarted_but_recovered — same path as Merge from openshift-hive/openshift #1 through detectAndTriggerEtcdRecovery.

The codecov/project failure is a cascading consequence: the TestReconcileETCDMemberRecovery failures cause the entire hypershift-operator test shard to exit non-zero → the "Upload to Codecov" step is skipped → Codecov sees 85 fewer files → reports a -3.68% coverage drop → marks the project check as failed.

Recommendations
  1. Fix the setEtcdRecoveryCondition guard clause to also trigger an update when the Reason changes, not just the Status. For example:

    if oldCondition == nil || oldCondition.Status != condition.Status || oldCondition.Reason != condition.Reason {
        meta.SetStatusCondition(&hcluster.Status.Conditions, condition)
        if err := r.Client.Status().Update(ctx, hcluster); err != nil {
            return fmt.Errorf("failed to update etcd recovery job condition: %w", err)
        }
    }

    Alternatively, compare the Message as well if message changes should also trigger updates, or simplify by always calling meta.SetStatusCondition (which already performs idempotency checks internally and only updates LastTransitionTime when Status changes).

  2. Consider removing the guard entirely and relying on meta.SetStatusCondition's built-in behavior, which handles LastTransitionTime updates correctly. The Status().Update() call is cheap when nothing changes on the server side, and removing the guard eliminates this class of bugs.

  3. Re-trigger the CI after the fix — the codecov/project failure will self-resolve once the unit tests pass and coverage data is uploaded.

Evidence
Evidence Detail
Failing check codecov/project — conclusion: failure, coverage 36.17% (-3.68%)
Codecov report "✅ All modified and coverable lines are covered by tests" — patch coverage passed, project check failed
Coverage drop cause "Upload to Codecov" step skipped due to test failure in hypershift-operator shard
Failing GH Actions job test / Unit Tests (hypershift-operator) — "Run tests" step failed, "Upload to Codecov" skipped
Failing test TestReconcileETCDMemberRecovery — 3 of 5 subtests failed
Assertion failure Expected <string>: EtcdRecoveryJobFailed to equal <string>: AsExpected (line 6920 of hostedcluster_controller_test.go)
Bug location etcd_recovery.go:121setEtcdRecoveryCondition guard: oldCondition.Status != condition.Status
Bug mechanism Guard only checks Status (ConditionFalse == ConditionFalse), skips Reason update (EtcdRecoveryJobFailed → AsExpected)
Files changed by PR etcd_recovery.go, hostedcluster_controller_test.go — both in hypershift-operator/controllers/hostedcluster/
Missing files in coverage 85 files, 23,969 lines from hypershift-operator/ package not uploaded to Codecov

Copy link
Copy Markdown
Contributor

@sdminonne sdminonne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally OK (aside that we're hard-coding helthy etcd repliacs = 3) Need to improve coverage (for example in case of errors).
Can we create instead a isETCDHealthy func?

Namespace: hcpNS,
},
Status: appsv1.StatefulSetStatus{
ReadyReplicas: 2,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this unhealthy 'cause we hardcode etcd spec.replicas = 3? Mind adding the spec then?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, made the changed as per suggestions.

if !apierrors.IsNotFound(err) {
return false, fmt.Errorf("failed to get etcd statefulset: %w", err)
}
} else if etcdStatefulSet.Status.ReadyReplicas == 3 && etcdStatefulSet.Status.AvailableReplicas == 3 {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is hardcoding etcdStatefulSet.Status.ReadyReplicas = 3 right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, made the changed as per suggestions.

…when etcd is healthy

When the etcd recovery job fails but etcd self-heals, the
EtcdRecoveryJobFailed condition was never cleared. This caused the
OpenShift Console to display a stale error message even when the
cluster was fully healthy.

This fix adds two checks:
- When a failed recovery job exists but etcd StatefulSet is fully
  available (3/3), clean up the job and clear the condition.
  Transient API errors are propagated instead of silently falling
  through to the failure path.
- When no failing etcd pods exist and etcd is healthy, clear any
  stale EtcdRecoveryJobFailed condition.

Signed-off-by: Vimal Solanki <vsolanki@redhat.com>
@vsolanki12 vsolanki12 force-pushed the OCPBUGS-84577-etcd-recovery-stale-condition branch from 6e9192b to d34dc0d Compare May 12, 2026 13:17
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 12, 2026

@vsolanki12: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants